Model Selection

Multimodal Visual Question Answering

# Multimodal Visual Question Answering

Qwen2.5 VL 72B Instruct FP8 Dynamic

FP8 quantized version of Qwen2.5-VL-72B-Instruct, supporting vision-text input and text output, optimized and released by Neural Magic.

Transformers English

Qwen2.5 VL 3B Instruct Quantized.w4a16

The quantized version of Qwen2.5-VL-3B-Instruct, with weights quantized to INT4 and activations quantized to FP16, designed for efficient vision-text task inference.

Transformers English

Qwen2.5 VL 72B Instruct FP8 Dynamic

The FP8 quantized version of Qwen2.5-VL-72B-Instruct, supporting vision-text input and text output, suitable for multimodal tasks.

Transformers English

Qwen2 VL 7B Instruct GGUF

A quantized version of the multimodal model based on Qwen2-VL-7B-Instruct, supporting image-text-to-text tasks with various quantization levels.

Image-to-Text English

Erax VL 7B V2.0 Preview GGUF

EraX-VL-7B-V2.0-Preview is a multimodal foundation model supporting Vietnamese, English, and Chinese, suitable for various vision-language tasks.

Image-to-Text Supports Multiple Languages

Erax VL 2B V1.5 Q4 K M GGUF

This is a multimodal visual question answering model supporting Vietnamese, English, and Chinese, converted to GGUF format based on erax-ai/EraX-VL-2B-V1.5.

Text-to-Image Supports Multiple Languages

QVQ 72B Preview GGUF

QVQ-72B-Preview is a multimodal large language model based on the imatrix quantization version of llamacpp, supporting multimodal understanding and generation of images and text.

Text-to-Image English

Qwen2 VL 7B Instruct GGUF

Qwen2-VL-7B-Instruct is a multimodal vision-language model that supports joint understanding and generation tasks for images and text.

Image-to-Text English

Paligemma2 28b Pt 896

PaliGemma 2 is a Vision-Language Model (VLM) launched by Google, combining the capabilities of the Gemma 2 language model and SigLIP vision model, supporting image and text inputs to generate text outputs.

Paligemma2 3b Mix 224

PaliGemma 2 is an upgraded vision-language model developed by Google, combining the capabilities of Gemma 2, supporting image and text inputs to generate text outputs, suitable for various vision-language tasks.

Minicpm Llama3 V 2 5 GGUF

MiniCPM-Llama3-V-2_5 is a multimodal visual question answering model based on the Llama3 architecture, supporting both Chinese and English interactions.

Text-to-Image Supports Multiple Languages

Llama 3.1 8B Vision 378

This project trained a projection module to add visual capabilities to Llama 3 using SigLIP technology, applied to the Llama-3.1-8B-Instruct model.

Yi-VL-6B is a multimodal vision-language model developed by 01-AI, supporting both Chinese and English, suitable for tasks like visual question answering.

Transformers Supports Multiple Languages

Paligemma 3b Ft Science Qa 448

PaliGemma is a 3B-parameter lightweight vision-language model developed by Google, built upon SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs.

Paligemma 3b Mix 448

PaliGemma is a versatile lightweight vision-language model (VLM) built upon the SigLIP vision model and Gemma language model, supporting image and text inputs to generate text outputs

Paligemma 3b Ft Docvqa 896

PaliGemma is a lightweight vision-language model developed by Google, built on the SigLIP vision model and the Gemma language model, supporting multilingual image-text understanding and generation.

Paligemma 3b Ft Vqav2 448

PaliGemma is a lightweight vision-language model developed by Google, combining image understanding and text generation capabilities, supporting multilingual tasks.

Paligemma 3b Ft Ocrvqa 448

PaliGemma is a versatile lightweight vision-language model (VLM) developed by Google, built on the SigLIP vision model and Gemma language model, supporting both image and text inputs with text outputs.

FireLLaVA-13B is a vision-language model trained on instruction data generated by open-source large language models, supporting image understanding and text generation tasks.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase